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ABSTRACT 

This study addressed which, if any, contemporary fit 
indices are least susceptible to the bias associated with 
confirmatory factor analysis (CFA) involving a large number of 
measured va^-iables. Data were obtained from student responses from 
1980 to 1990 on the Student Evaluations of Educational Quality (SEEQ) 
instrument of H. Marsh (1987). Factor analytic studies have validated 
the factor structure of the SEEQ. For this study, only student scores 
for 28 SEEQ items (7,407 classes) were included in a CFA modeK Fi t 
indices evaluated were: (1) the GFI (goodness of fit) index of K. G. 
Joreskog; (2) the Sat orra-Bent ler chi"square test; (3) Joreskog's 
adjusted goodness of fit index (AGFI) ; (4) the Bentler-Bonnett normed 
fit index (NFI) ; (5) the comparative fit index (CFI) of P. M. 
Bentler; and (6) the index of L. R. Tucker and C. Lewis (TLI) . 
Bentler's CFI, the Bentler-Bonnett NFI, and the TLI were highly 
stable within seven factor models varying from 14 to 28 items. 
Because t\e CFI and TLI have the traditional advantage of protecting 
against the bias associated with large samples, results support their 
routine use as an adjunct to the chi-square test in CFA. Three tables 
present analysis and comparison results. (Contains 13 references.) 
(SLD) 
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Numerous investigations have been conducted on dozen of 
proposed indices of goodness-of -f it in confirmatory factor 
analysis. The focus of these investigations has been largely 
limited to sample size (see review and discussions by Bentler, 
1990; Bollen, 1990; Gerbing el al . , 1992, Marsh et al . , 1988; 
Mulaik et al . , 1989). It is well-known that a large sample size 
"biases" the chi- square test in a confirmatory analysis in favor 
of model rejection, and researchers have proposed a variety fit 
indices as solutions to this problem. Indeed, two fit indices, 
the Tucker-Lewis (1973) Index TLI (also called the non-normed fit 
index by Bentler and others), and Bentler's (1930) comparative 
fit index (CFI) , appear to have at least partially solved this 
problem (Marsh et al . , 1988; Bentler, 1990). 

Although originally pointed out by Fornell (1983) , much less 
attention has been paid to another type of "bias" that is 
inherent in a confirmatory factor analysis. Fornell (1983) 
points out that larger models (those with many items or 
indicators) are more likely to be rejected than smaller models. 
Stated another way, if we were to analyze a 12 -item personality 
test with three 4 -item dimensions using the 12x12 item covariance 
matrix as input, we would likely reject the model (at least our 
experience suggests that this is the case) . In contrast, if we 
were to simplify the aforementioned measurement model by adding 
items from the same dimension and forming a 6x6 covariance of 
"doublets" (as is often done) , we would likely not reject the 
model, or at least its fit would be considerably better than when 
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^ analyzed a 12x12 matrix. Thus, the problem is that identical 
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data can be arbitrarily configured in ways that either support or 
do not support the a priori measurement model. 

The problem is even more evident when one considers 
confirmatory factor analysis in light of traditional reliability 
theory. As is well known, increasing the number of indicators 
(e.g., items) is a widely recommended method for assuring high 
reliability. It is almost certain that a longer measure will be 
better than a shorter measure. But in the context of 
confirmatory factor analysis, the opposite is true. Longer 
measures will typically be poorer than shorter measures, at least 
in terms of model fit. One can easily verify this fact by 
analyzing a full 20-item, uni -dimensional questionnaire, and then 
comparing its fit to that of its two randomly determined halves 
(as determined in two separate confirmatory factor analyses) . In 
our experience, the fit of the complete questionnaire will be 
considerably better than the fit of the shorter questionnaire. 
Perhaps because this tendency is not well known, a cursory review 
of recent confirmatory factor analytic articles indicates that 
the typical researcher readily accepts two or three indicator 
confirmatory models without even examining the reliability of the 
constructs that the confirmatory factor analysis "supports". 

Our anecdotal evidence and at least some empirical evidence- 
(Hocevar et al . , 1984) suggests that the traditional chi-square 
test is strongly biased against models with a large number of 
measured variables . It is reasonably to expect that some 
contemporary fit indices might control for this bias. In our 
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estimation, over fifty such indices have been proposed to date. 
For practical reasons, we will li.mit the present analysis to 
those which are available in two well-known structural equation 
modeling computer programs - - - eQS Version 4.02 (Bentler, 1993) 
and LISREL 8 (Joreskog & Sorbom, 1993) . The issue to be 
addressed in this study is which (if any) contemporary fit 
indices are least susceptible to the bias associated with 
confirmatory factor analysis that involves a large number of 
measured variables. 

Method 

Data were obtained from student responses between 1980 to 
1990 to Marsh' (1987) Students' Evaluations of Educational 
Quality (SEEQ) instrument. The instrument has 41 items with 
clusters of these items designed to measure nine separate 
dimensions of instructor and course effectiveness. Factor 
analytic studies (e.g., Marsh & Hocevar, 1991) have validated the 
SEEQ factor structure underlying nine dimensions of teaching and 
course effectiveness. Each SEEQ item was rated on a Likert-scale 
from 1 to 5 with high score indicating rating effectiveness. For 
this study only student scores for 28 SEEQ items were included in 
a confirmatory factor analytic model. The CFA model specified a 
priori measurement model with seven factors, each having four 
item loadings as follows: Learning Value (item 1-4), Instructor 
Enthusiam (item 5-8) , Organization/Clarity (item 9-12) , Group 
Interaction (item 13-16) , Individual Rapport (item 17-20) , 



Breadth of Coverage {item 21-24) , and Workload/Difficult (item 
32-35) . 

The data were screened with listwise deletion by PRELIS 2 
(Joreskog & Sorbom, 1993) which resulted in a final sample of 
7,407 classes. Item responses in each class were averaged across 
students to create a data matrix with 28 x 7,407 continuous 
elements. Sample covariances were derived from this matrix for 
model estimation using LISREL 8 and EQS Version 4.02. 

An initial CFA model with 28 items loading on their 
designated seven separate but intercorrelated factors was first 
estimated. In subsequent runs, the same CFA model was maintained 
but the number of items per factor was then reduced by random 
deletion to 3 and then to 2. Thus, three highly similar CFA 
models with 28 (4x7) , 21 (3x7) , and 14 (2x7) items were analyzed. 
All model parameters and goodness-of -f it indices were estimated 
by both LISREL 8 and EQS. Because the items exhibited high 
skewnesses ranging from -3.1074 to .4512 and high kurtoses 
ranging from 1.5576 to 15.6020, two methods of estimation were 
used: (a) maximum likelihood (ML) method by both LISREL 8 and 
EQS, and (b) robust ML by EQS and LISREL weighted laast square 
(WLS) distribution- free method for non-normal data. A comparison 
of the two methods of estimation provided a test for the highly 
non-normal data of the influence of a violation of the normal 
theory assumption by subjecting the data under the normal ML and 
the robust ML provided by EQS ' s Satorra-Bentler scaled chi -square 
and the LISREL asymptotic distribution-free estimation. 
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Results 

With the ML method, both LISREL 8 and EQS produced 
consistent results for model fits and parameter estimates. The 
chi-square values were inexplicably somewhat lower with LISREL 8 
than with EQS but the differences were of no significant meaning. 
The estimates of standard errors by the normal ML were negatively 
downward biased by a range of -.002 to -.007 in comparison to the 
same estimates obtained with the robust ML method by EQS. This 
result confirmed existing research findings (e.g., Muthen 5c 
Kaplan, 1985) of the downward bias of the normal ML when used 
with severely non-normal data. Model fits were improved markedly 
with the Satorra-Bentler scaled statistic under EQS robust ML 
estimation. Thus, the findings discussed below are based on the 
Satorra-Bentler scaled test statistic when possible. 

1. Joreskog's GFI index. Poorer fit for larger models was 
noted on Joreskog^s goodness-of -f it index. GFI index values were 

.872, .813, and .740 for the 14, 21, and 28 item models, 

respectively (Table 1) . 



Insert Table 1 about here 



2. Satorra-Bentler chi-square test. As predicted, the chi- 
square goodness-of - fit test was strongly "biased" against models 
that included a large number of measured variables. 
Specifically, chi-squares equaled 1,960, 3,918, and 9,042 for the 
14, 21, and 28 item models, respectively (Table 2) . 
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Insert Table 2 about here 



3. Joreskog's AGFI index. Joreskog^s adjusted goodness -of - 
fit index adjusts for degree of freedom. Thus, we expected that 
this index might not be susceptible to large model bias. This 
expectation was disconf irmed: The AGFI index had values of .760, 
.742, and .679 for the 14, 21, and 28 item models, respectively 
(Table 1) . 

4. Bentler-Bonett normed fit index (NFI) . The NFI had a 
strong negative monotonic relationship with the number of items. 
For models with 28, 21, and 14 items, the NFI ranged from .975, 
.985, and .987 (Table 2) and from .969, .978, and .988 (Table 3). 
The strong stability of the NFI in models containing different 
numbers of measured variables supports the conclusion that the 
NFI is not biased against larger models. 



Insert Table 3 about here 



5. Bentler's comparative fit index (CFI) . The CFI was 
proposed as a way of controlling for the well-known sample size 
bias inherent in the chi-square test. In our analysis, the CFI 
index had a negative monotonic relationship with the number of 
items, but its strong stability (Table 2) in models containing 
different numbers of measured variables supports the conclusion 
that the CFI is not biased against larger models. 



6. Tucker-Lewis index (TLI) (also known as the non-normed 
fit index) . The TLI, originally proposed in 1973 by Tucker and 
Lewis, has been more recently advocated by Marsh et al . (1988) as 
a way of controlling for sample size effects. In this study, the 
NNFI was the only index that did not have a monotonic 
relationship with the number of items, and similar to the CFI, it 
was very stable. Specifically, the TLI had values of .972, .981, 
and .978 (Table 2) and .964, .972, and .964 (Table 3) for 
analyses with 28, 21, and 14 items respectively. 

Conclusion 

As predicted at the onset of this study, models with a 
larger number of items had poorer fits when fit was assessed 
using the chi-square statistic. Neither Joreskog's GFI or his 
AGFI adequately controlled for the number of items. However, 
Rentier* s CFI, Bentler-Bonett NFI, and the Tucker-Lewis TLI were 
highly stable within seven factor models varying from 14 to 28 
items. Because the CFI and TLI have the ciaditional advantage of 
protecting against the bias associated with large samples, our 
results support their routine use as an adjunct to the chi-square 
test in confirmatory factor analysis. 
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